Deriving Dynamics of Web Pages: A Survey

نویسندگان

  • Marilena Oita
  • Pierre Senellart
چکیده

The World Wide Web is dynamic by nature: content is continuously added, deleted, or changed, which makes it challenging for Web crawlers to keep up-to-date with the current version of a Web page, all the more so since not all apparent changes are significant ones. We review major approaches to change detection in Web pages and extraction of temporal properties (especially, timestamps) of Web pages. We focus our attention on techniques and systems that have been proposed in the last ten years and we analyze them to get some insight into the practical solutions and best practices available. We aim at providing an analytical view of the range of methods that can be used, distinguishing them on several dimensions, especially, their static or dynamic nature, the modeling of Web pages, or, for dynamic methods relying on comparison of successive versions of a page, the similarity metrics used. We advocate for more comprehensive studies of the effectiveness of Web page change detection methods, and finally highlight open issues.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

ارزیابی کیفیت صفحات‌ وب پژوهشگاه‌های وابسته به وزارت علوم، تحقیقات و فن‌آوری‌ مستقر در شهر تهران از دیدگاه کاربران

Especially in research centers, evaluating the quality of web pages from clients' point of view has a constructive role in their design and development, since it makes the web developers familiar with client's perspective and assists them in designing client-oriented web sites in scientific and research environment. As a model for assessing the quality of web pages, "webQual" attempts to provid...

متن کامل

Analyzing new features of infected web content in detection of malicious web pages

Recent improvements in web standards and technologies enable the attackers to hide and obfuscate infectious codes with new methods and thus escaping the security filters. In this paper, we study the application of machine learning techniques in detecting malicious web pages. In order to detect malicious web pages, we propose and analyze a novel set of features including HTML, JavaScript (jQuery...

متن کامل

T-Rank: Time-Aware Authority Ranking

Analyzing the link structure of the web for deriving a page’s authority and implied importance has deeply affected the way information providers create and link content, the ranking in web search engines, and the users’ access behavior. Due to the enormous dynamics of the web, with millions of pages created, updated, deleted, and linked to every day, timeliness of web pages and links is a cruci...

متن کامل

Inférer des Objets Sémantiques du Web Structuré

This thesis focuses on the extraction and analysis of Web data objects, investigated from different points of view: temporal, structural, semantic. We first survey different strategies and best practices for deriving temporal aspects of Web pages, together with a more in-depth study on Web feeds for this particular purpose, and other statistics. Next, in the context of dynamically-generated Web...

متن کامل

بررسی ارتباط بین کیفیت اطلاعات و شاخص های ظاهری در صفحات وب فارسی مرتبط با حوزه سلامت عمومی

  Introduction: One approach to evaluate the quality of a web page is to investigate its external markers. The purpose of the present study is to determine the relationship between information quality of Persian public health web pages and their external quality.   Methods: The samples of this correlation study were selected from among the freely available ten-key word texts of chronic diseases...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011